1 Project Introduction

On January 12, 2010, a magnitude 7.0 earthquake struck Haiti causing significant damage which affected approximately 3 million citizens. In the wake of the disaster, aid groups were working to locate displaced persons and provide them with food and water. However, due to the large scale destruction of infrastructure over a wide area additional assistance was needed to locate people quickly.

Little is left of a neighborhood on a hillside near downtown Port-au-Prince on Jan. 15. More than a million people were displaced by the quake. (David Gilkey/NPR)

Displaced persons were known to be making make-shift shelters out of blue tarps. High resolution geo-refereneced images were captured by aircraft of the destroyed areas. The data generated by the image collection was too large for aid workers to process in time to supply aid. Therefore, a team from the Rochester Institute of Technology used data-mining algorithms to analyze the images and identify blue tarps. The goal was to effectively locate displaced persons and communicate their location to rescue workers so they could get resources to people who needed it in time.

Sample image of a geo-referenced image used for the analysis

As the final project for SYS 6018 - Data Mining, we were assigned to build models from the different techniques we learned in the course to, as accurately as possible, and in as timely a manner as possible, locate as many of the displaced persons identified in the imagery data so that they could be provided food and water before their situations became unsurvivable. The data made available to students consisted of a csv of red, green, blue pixel values. A final hold-out data set presented in the format of multiple text files was provided as well.

2 Risk Analysis / Cost Benefit / Budget for Project

The US Government spent $1.5B on Haiti disaster relief by the end of 2010.
method of delivery cost of resources *time from disaster to unsurvivable conditions

3 Analysis Methods

3.1 EDA

To begin the analysis the data set was loaded into a data frame and checked for and NA values that would have to be addressed. The data was already cleaned and did not contain any NA values.

3.1.1 Check NA

df <- tibble(read.csv("HaitiPixels.csv")) #read in df
"Check for NA values" 
#> [1] "Check for NA values"
anyNA(df) #check for NA values 
#> [1] FALSE
summary(df) #quick look at data
#>     Class                Red          Green            Blue      
#>  Length:63241       Min.   : 48   Min.   : 48.0   Min.   : 44.0  
#>  Class :character   1st Qu.: 80   1st Qu.: 78.0   1st Qu.: 63.0  
#>  Mode  :character   Median :163   Median :148.0   Median :123.0  
#>                     Mean   :163   Mean   :153.7   Mean   :125.1  
#>                     3rd Qu.:255   3rd Qu.:226.0   3rd Qu.:181.0  
#>                     Max.   :255   Max.   :255.0   Max.   :255.0

The data provided for analysis was generated from overhead images and stored as a three channel output. The channels represented the red, green, and blue values for pixels within images. RGB color model is referred to as an additive model. The integer value for the red, green, and blue channels are combined to represent a color. Typically, the component values are stored as an 8 bit integer ranging from 0 to 255.

The data was visualized with the ggpairs function. For a pair of variables chosen from the data frame, Ggpairs generates a scatterplot, displays a Pearson correlation, and, on the diagonal, shows a variable distribution. The plots were also color-coded by class. The class describes what kind of object a pixel is associated with. In our data frame there were the following classes: Blue Tarp, Rooftop, Soil, Various Non-tarp, and Vegetation.

Inspection of the generated figure indicated that

3.1.2 Scatter and Correlation

#Reference [1]
# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# To use for fills, add
#scale_fill_manual(values=cbPalette)
  
df$Class <- factor(df$Class) #make Class a factor variable. 
ggpairs(df[,2:4], lower.panel = NULL, upper = list(continuous = wrap("cor", size = 3)), aes(color=df$Class))# + scale_fill_manual(values=cbPalette) 

#view scatter and correlations
attach(df) #attach df variables 

!!!!!!!!!! IF I HAVE TIME MAKE A SELECTOR TO CHOOSE COLOR SCHEME FOR NOT COLOR BLIND OR DIFFERENT KINDS OF COLOR BLIND https://socviz.co/refineplots.html

3.1.3 3D Scatter

fig <- plot_ly(df, x=~Red, y=~Green, z=~Blue, color=~Class) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene=list(xaxis=list(title="Red"),
                     yaxis = list(title = 'Green'),
                     zaxis = list(title = 'Blue')))

fig

3.1.4 Boxplots

!!!!!!!!!!!!! Make some comments about this figure or remove it…

#take a look at the distribution of classes and RBG values 
featurePlot(x = df[,2:4],
            y = df$Class,
            plot = "box",
            layout = c(3,1), 
            scales = list(y = list(relation = "free"),
                          x = list(rot = 90)))

3.2 Prepare Data Frame for Analysis

The goal of this project is to identify blue tarps from non-blue tarps … Co5nvert Class column into a binary categorical variable indicating if a pixel indicates a blue tarp or not a blue tarp.

df <- cbind(mutate(df, "Blue_Tarp_or_Not"=ifelse(Class != "Blue Tarp", 0, 1))) #add binary column indicating whether the Class variable is "Blue Tarp" or not
attach(df)
df$Blue_Tarp_or_Not <- factor(Blue_Tarp_or_Not, labels = c("NBT", "BT"))#, levels =c(0,1), labels = c("NBT", "BT")) #ensure new column is a factor 
head(df)
#>        Class Red Green Blue Blue_Tarp_or_Not
#> 1 Vegetation  64    67   50              NBT
#> 2 Vegetation  64    67   50              NBT
#> 3 Vegetation  64    66   49              NBT
#> 4 Vegetation  75    82   53              NBT
#> 5 Vegetation  74    82   54              NBT
#> 6 Vegetation  72    76   52              NBT
df_factor  <- df[, -1]
tail(df_factor)
#>       Red Green Blue Blue_Tarp_or_Not
#> 63236 136   145  150               BT
#> 63237 138   146  150               BT
#> 63238 134   141  152               BT
#> 63239 136   143  151               BT
#> 63240 132   139  149               BT
#> 63241 133   141  153               BT
#In order to make run times faster when tuning parameters subset data with 20%
trainIndex <- createDataPartition(df_factor$Blue_Tarp_or_Not, p=0.2,
                                  list=FALSE,
                                  times=1)
df_subset <- df_factor[trainIndex,]

3.3 Logistic Regression

Fit a Logistic Regression Model !!!Need to turn on the fold result saving …

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE) 

set.seed(4)
glm.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_subset, #df_factor,
                    method="glm",
                    family="binomial",
                    trControl= fitControl)

glm.fit
#> Generalized Linear Model 
#> 
#> 12649 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 11384, 11385, 11384, 11383, 11383, 11384, ... 
#> Resampling results:
#> 
#>   Accuracy   Kappa   
#>   0.9944654  0.904548
summary(glm.fit)
#> 
#> Call:
#> NULL
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -3.2342  -0.0275  -0.0031   0.0000   3.5742  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -0.06950    0.37933  -0.183    0.855    
#> Red         -0.22208    0.02322  -9.563   <2e-16 ***
#> Green       -0.21367    0.02582  -8.277   <2e-16 ***
#> Blue         0.43111    0.02987  14.432   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 3584.46  on 12648  degrees of freedom
#> Residual deviance:  421.77  on 12645  degrees of freedom
#> AIC: 429.77
#> 
#> Number of Fisher Scoring iterations: 11
glm.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"... 
#>     Accuracy     Kappa parameter Resample
#> 1  0.9952569 0.9261242      none   Fold01
#> 2  0.9960443 0.9313043      none   Fold02
#> 3  0.9960474 0.9313060      none   Fold03
#> 4  0.9984202 0.9747934      none   Fold04
#> 5  0.9928910 0.8731096      none   Fold05
#> 6  0.9952569 0.9225526      none   Fold06
#> 7  0.9928854 0.8763669      none   Fold07
#> 8  0.9912975 0.8448768      none   Fold08
#> 9  0.9912975 0.8406674      none   Fold09
#> 10 0.9952569 0.9243783      none   Fold10

Test model performance on Train data to select threshold values…

#pass
glm.prob <- predict(glm.fit, newdata=df_subset , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
glm_roc <- roc(df_subset $Blue_Tarp_or_Not, glm.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#965fd4", lwd=4, print.auc=TRUE, main="GLM ROC Curve") 

roc.info_glm <- roc(df_subset$Blue_Tarp_or_Not, glm.prob[,2], legacy.axes=TRUE)
roc.glm.df <- data.frame(tpp=roc.info_glm$sensitivities*100, fpp=(1-roc.info_glm$specificities)*100, thresholds=roc.info_glm$thresholds)
#roc.glm.df[roc.glm.df>98.5 & roc.glm.df < 99,]

glm.thresholds <- data.matrix(roc.glm.df$thresholds)

fig <- plot_ly(roc.glm.df, x=~tpp, y=~fpp, z=~thresholds) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
                     yaxis = list(title = 'False Positive Rate'),
                     zaxis = list(title = 'Threshold')))

fig
lr.thresh <- 0.5
glm.pred_thresh <- factor(ifelse(glm.prob[,2]>lr.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.glm_thresh <- confusionMatrix(factor(glm.pred_thresh),df_subset $Blue_Tarp_or_Not, positive = "BT") 
"Threshold: 0.5"
#> [1] "Threshold: 0.5"
cm.glm_thresh
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 12230    54
#>        BT     14   351
#>                                           
#>                Accuracy : 0.9946          
#>                  95% CI : (0.9932, 0.9958)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.9089          
#>                                           
#>  Mcnemar's Test P-Value : 2.251e-06       
#>                                           
#>             Sensitivity : 0.86667         
#>             Specificity : 0.99886         
#>          Pos Pred Value : 0.96164         
#>          Neg Pred Value : 0.99560         
#>              Prevalence : 0.03202         
#>          Detection Rate : 0.02775         
#>    Detection Prevalence : 0.02886         
#>       Balanced Accuracy : 0.93276         
#>                                           
#>        'Positive' Class : BT              
#> 
acc_LR <- cm.glm_thresh[["overall"]][["Accuracy"]]*100
auc_LR <- glm_roc[["auc"]]
thresh_LR <- lr.thresh
sens_LR <-  cm.glm_thresh[["byClass"]][["Sensitivity"]]*100
spec_LR <- cm.glm_thresh[["byClass"]][["Specificity"]]*100
FDR_LR <- ((cm.glm_thresh[["table"]][2,1])/(cm.glm_thresh[["table"]][2,1]+cm.glm_thresh[["table"]][2,2]))*100
prec_LR <- cm.glm_thresh[["byClass"]][["Precision"]]*100

3.4 LDA

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
lda.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_subset, #df_factor,,
                    preProcess=c("center","scale"),
                    method="lda",
                    verbose= FALSE,
                    trControl= fitControl)

lda.fit
#> Linear Discriminant Analysis 
#> 
#> 12649 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 11384, 11385, 11384, 11383, 11383, 11384, ... 
#> Resampling results:
#> 
#>   Accuracy   Kappa    
#>   0.9839518  0.7512397
#summary(lda.fit)
#pass
lda.prob <- predict(lda.fit, newdata=df_subset, type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
lda_roc <- roc(df_subset$Blue_Tarp_or_Not, lda.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#965fd4", lwd=4, print.auc=TRUE, main="LDA ROC Curve") 

lda.thresh <- 0.5
lda.pred_thresh <- factor(ifelse(lda.prob[,2]>lda.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.lda_thresh <- confusionMatrix(factor(lda.pred_thresh),df_subset$Blue_Tarp_or_Not, positive = "BT") 
"Threshold: 0.5"
#> [1] "Threshold: 0.5"
cm.lda_thresh
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 12127    83
#>        BT    117   322
#>                                           
#>                Accuracy : 0.9842          
#>                  95% CI : (0.9819, 0.9863)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2e-16         
#>                                           
#>                   Kappa : 0.7549          
#>                                           
#>  Mcnemar's Test P-Value : 0.01962         
#>                                           
#>             Sensitivity : 0.79506         
#>             Specificity : 0.99044         
#>          Pos Pred Value : 0.73349         
#>          Neg Pred Value : 0.99320         
#>              Prevalence : 0.03202         
#>          Detection Rate : 0.02546         
#>    Detection Prevalence : 0.03471         
#>       Balanced Accuracy : 0.89275         
#>                                           
#>        'Positive' Class : BT              
#> 
acc_lda <- cm.lda_thresh[["overall"]][["Accuracy"]]*100
auc_lda <- lda_roc[["auc"]]
thresh_lda <- lr.thresh
sens_lda <-  cm.lda_thresh[["byClass"]][["Sensitivity"]]*100
spec_lda <- cm.lda_thresh[["byClass"]][["Specificity"]]*100
FDR_lda <- ((cm.lda_thresh[["table"]][2,1])/(cm.lda_thresh[["table"]][2,1]+cm.lda_thresh[["table"]][2,2]))*100
prec_lda <- cm.lda_thresh[["byClass"]][["Precision"]]*100

3.5 QDA

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
qda.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_subset, #df_factor,,
                    preProcess=c("center","scale"),
                    method="qda",
                    verbose= FALSE,
                    trControl= fitControl)

qda.fit
#> Quadratic Discriminant Analysis 
#> 
#> 12649 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 11384, 11385, 11384, 11383, 11383, 11384, ... 
#> Resampling results:
#> 
#>   Accuracy   Kappa    
#>   0.9940701  0.8952008
#pass
qda.prob <- predict(qda.fit, newdata=df_subset , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
qda_roc <- roc(df_subset $Blue_Tarp_or_Not, qda.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#965fd4", lwd=4, print.auc=TRUE, main="QDA ROC Curve") 

qda.thresh <- 0.5
qda.pred_thresh <- factor(ifelse(qda.prob[,2]>qda.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.qda_thresh <- confusionMatrix(factor(qda.pred_thresh),df_subset $Blue_Tarp_or_Not, positive = "BT") 
"Threshold: 0.5"
#> [1] "Threshold: 0.5"
cm.qda_thresh
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 12240    70
#>        BT      4   335
#>                                           
#>                Accuracy : 0.9941          
#>                  95% CI : (0.9927, 0.9954)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.8975          
#>                                           
#>  Mcnemar's Test P-Value : 4.153e-14       
#>                                           
#>             Sensitivity : 0.82716         
#>             Specificity : 0.99967         
#>          Pos Pred Value : 0.98820         
#>          Neg Pred Value : 0.99431         
#>              Prevalence : 0.03202         
#>          Detection Rate : 0.02648         
#>    Detection Prevalence : 0.02680         
#>       Balanced Accuracy : 0.91342         
#>                                           
#>        'Positive' Class : BT              
#> 
acc_qda <- cm.qda_thresh[["overall"]][["Accuracy"]]*100
auc_qda <- qda_roc[["auc"]]
thresh_qda <- lr.thresh
sens_qda <-  cm.qda_thresh[["byClass"]][["Sensitivity"]]*100
spec_qda <- cm.qda_thresh[["byClass"]][["Specificity"]]*100
FDR_qda <- ((cm.qda_thresh[["table"]][2,1])/(cm.qda_thresh[["table"]][2,1]+cm.qda_thresh[["table"]][2,2]))*100
prec_qda <- cm.qda_thresh[["byClass"]][["Precision"]]*100

3.6 KNN

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
knn.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_subset, #df_factor,,
                    preProcess=c("center","scale"),
                    method="knn",
                    trControl= fitControl,
                    tuneLength=5
                    )

knn.fit
#> k-Nearest Neighbors 
#> 
#> 12649 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 11384, 11385, 11384, 11383, 11383, 11384, ... 
#> Resampling results across tuning parameters:
#> 
#>   k   Accuracy   Kappa    
#>    5  0.9953353  0.9248807
#>    7  0.9957305  0.9311033
#>    9  0.9955726  0.9284574
#>   11  0.9957307  0.9308796
#>   13  0.9954146  0.9256031
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 11.
#pass
knn.prob <- predict(knn.fit, newdata=df_subset , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
knn_roc <- roc(df_subset $Blue_Tarp_or_Not, knn.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#965fd4", lwd=4, print.auc=TRUE, main="KNN ROC Curve") 

knn.thresh <- 0.5
knn.pred_thresh <- factor(ifelse(knn.prob[,2]>knn.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.knn_thresh <- confusionMatrix(factor(knn.pred_thresh),df_subset $Blue_Tarp_or_Not, positive = "BT") 
"Threshold: 0.5"
#> [1] "Threshold: 0.5"
cm.knn_thresh
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 12219    19
#>        BT     25   386
#>                                           
#>                Accuracy : 0.9965          
#>                  95% CI : (0.9953, 0.9975)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : <2e-16          
#>                                           
#>                   Kappa : 0.9443          
#>                                           
#>  Mcnemar's Test P-Value : 0.451           
#>                                           
#>             Sensitivity : 0.95309         
#>             Specificity : 0.99796         
#>          Pos Pred Value : 0.93917         
#>          Neg Pred Value : 0.99845         
#>              Prevalence : 0.03202         
#>          Detection Rate : 0.03052         
#>    Detection Prevalence : 0.03249         
#>       Balanced Accuracy : 0.97552         
#>                                           
#>        'Positive' Class : BT              
#> 
acc_knn <- cm.knn_thresh[["overall"]][["Accuracy"]]*100
auc_knn <- knn_roc[["auc"]]
thresh_knn <- lr.thresh
sens_knn <-  cm.knn_thresh[["byClass"]][["Sensitivity"]]*100
spec_knn <- cm.knn_thresh[["byClass"]][["Specificity"]]*100
FDR_knn <- ((cm.knn_thresh[["table"]][2,1])/(cm.knn_thresh[["table"]][2,1]+cm.knn_thresh[["table"]][2,2]))*100
prec_knn <- cm.knn_thresh[["byClass"]][["Precision"]]*100
k_knn <- knn.fit[["bestTune"]][["k"]]

3.7 Random Forest

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
rf.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_subset, #df_factor,,
                    preProcess=c("center","scale"),
                    method="rf", #what is the difference between the different caret rf models??
                    trControl= fitControl,
                    tuneLength=3
                    )
#> note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
rf.fit
#> Random Forest 
#> 
#> 12649 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 11384, 11385, 11384, 11383, 11383, 11384, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>   2     0.9950983  0.9202957
#>   3     0.9948613  0.9162022
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.
#pass
RF.prob <- predict(rf.fit, newdata=df_subset , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
RF_roc <- roc(df_subset $Blue_Tarp_or_Not, RF.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#965fd4", lwd=4, print.auc=TRUE, main="RF ROC Curve") 

RF.thresh <- 0.5
RF.pred_thresh <- factor(ifelse(RF.prob[,2]>RF.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.RF_thresh <- confusionMatrix(factor(RF.pred_thresh),df_subset $Blue_Tarp_or_Not, positive = "BT") 
"Threshold: 0.5"
#> [1] "Threshold: 0.5"
cm.RF_thresh
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 12244     7
#>        BT      0   398
#>                                           
#>                Accuracy : 0.9994          
#>                  95% CI : (0.9989, 0.9998)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2e-16         
#>                                           
#>                   Kappa : 0.991           
#>                                           
#>  Mcnemar's Test P-Value : 0.02334         
#>                                           
#>             Sensitivity : 0.98272         
#>             Specificity : 1.00000         
#>          Pos Pred Value : 1.00000         
#>          Neg Pred Value : 0.99943         
#>              Prevalence : 0.03202         
#>          Detection Rate : 0.03146         
#>    Detection Prevalence : 0.03146         
#>       Balanced Accuracy : 0.99136         
#>                                           
#>        'Positive' Class : BT              
#> 
acc_RF <- cm.RF_thresh[["overall"]][["Accuracy"]]*100
auc_RF <- RF_roc[["auc"]]
thresh_RF <- lr.thresh
sens_RF <-  cm.RF_thresh[["byClass"]][["Sensitivity"]]*100
spec_RF <- cm.RF_thresh[["byClass"]][["Specificity"]]*100
FDR_RF <- ((cm.RF_thresh[["table"]][2,1])/(cm.RF_thresh[["table"]][2,1]+cm.RF_thresh[["table"]][2,2]))*100
prec_RF <- cm.RF_thresh[["byClass"]][["Precision"]]*100

3.8 SVM

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
svm.radial.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_subset, #df_factor,,
                    preProcess=c("center","scale"),
                    method="svmRadial",
                    trControl= fitControl,
                    tuneLength=3
                    )

svm.radial.fit
#> Support Vector Machines with Radial Basis Function Kernel 
#> 
#> 12649 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 11384, 11385, 11384, 11383, 11383, 11384, ... 
#> Resampling results across tuning parameters:
#> 
#>   C     Accuracy   Kappa    
#>   0.25  0.9958886  0.9320385
#>   0.50  0.9958887  0.9321380
#>   1.00  0.9958097  0.9308594
#> 
#> Tuning parameter 'sigma' was held constant at a value of 8.537064
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were sigma = 8.537064 and C = 0.5.
#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
svm.linear.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_subset, #df_factor ,,
                    preProcess=c("center","scale"),
                    method="svmLinear",
                    trControl= fitControl,
                    tuneLength=3
                    )

svm.linear.fit
#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
svm.poly.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_subset, #df_factor,,
                    preProcess=c("center","scale"),
                    method="svmPoly",
                    trControl= fitControl,
                    tuneLength=3
                    )

svm.poly.fit
#pass
SVM.prob <- predict(svm.radial.fit, newdata=df_subset , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
SVM_roc <- roc(df_subset $Blue_Tarp_or_Not, SVM.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#965fd4", lwd=4, print.auc=TRUE, main="SVM ROC Curve") 

SVM.thresh <- 0.5
SVM.pred_thresh <- factor(ifelse(SVM.prob[,2]>SVM.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.SVM_thresh <- confusionMatrix(factor(SVM.pred_thresh),df_subset $Blue_Tarp_or_Not, positive = "BT") 
"Threshold: 0.5"
#> [1] "Threshold: 0.5"
cm.SVM_thresh
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 12228    30
#>        BT     16   375
#>                                           
#>                Accuracy : 0.9964          
#>                  95% CI : (0.9952, 0.9973)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2e-16         
#>                                           
#>                   Kappa : 0.9403          
#>                                           
#>  Mcnemar's Test P-Value : 0.05527         
#>                                           
#>             Sensitivity : 0.92593         
#>             Specificity : 0.99869         
#>          Pos Pred Value : 0.95908         
#>          Neg Pred Value : 0.99755         
#>              Prevalence : 0.03202         
#>          Detection Rate : 0.02965         
#>    Detection Prevalence : 0.03091         
#>       Balanced Accuracy : 0.96231         
#>                                           
#>        'Positive' Class : BT              
#> 
acc_SVM <- cm.SVM_thresh[["overall"]][["Accuracy"]]*100
auc_SVM <- SVM_roc[["auc"]]
thresh_SVM <- lr.thresh
sens_SVM <-  cm.SVM_thresh[["byClass"]][["Sensitivity"]]*100
spec_SVM <- cm.SVM_thresh[["byClass"]][["Specificity"]]*100
FDR_SVM <- ((cm.SVM_thresh[["table"]][2,1])/(cm.SVM_thresh[["table"]][2,1]+cm.SVM_thresh[["table"]][2,2]))*100
prec_SVM <- cm.SVM_thresh[["byClass"]][["Precision"]]*100

The final values used for the model were degree = 3, scale = 0.1 and C = 1.

3.9 K-Folds Out of Sampling Performance

3.9.1 Table 2 - Performance Metrics: 10-Fold Cross-Validation Metrics

!!!!! THESE VALUES HAVE NOT BEEN UPDATED, POSSIBLY CONSIDER AUTOMATING THEM

Method KNN (k = 11) LDA QDA Log. Regression Random Forest (tuning param = ?) SVM (tuning param = ?)
Accuracy 99.6521464% 98.4188473% 99.4149735% 99.4624081% 99.9446597 99.6363349
AUC 99.9631566% 98.7970424% 99.7916541% 99.789577% 99.9625919 99.9424359
ROC
Threshold 0.5 0.5 0.5 0.5 0.5 0.5
Sensitivity=Recall=Power 95.308642% 79.5061728% 82.7160494% 86.6666667% 98.2716049 92.5925926
Specificity=1-FPR 99.7958184% 99.0444299% 99.9673309% 99.8856583% 100 99.8693238
FDR 6.0827251% 26.6514806% 1.179941% 3.8356164% 0 4.0920716
Precision=PPV 93.9172749% 73.3485194% 98.820059% 96.1643836% 100 95.9079284

3.10 Model Performance

3.10.1 Discussion

(discussion on FHO data why we do this… what the benefits are… potential pitfalls)

(discussion somewhere about ROC curves AUC and… other metrics)

3.11 Hold-Out Test Sample Performance

3.11.1 Table 3 - Performance Metrics: Hold-Out Test Data Set Scores

!!!!! THESE VALUES HAVE NOT BEEN UPDATED, POSSIBLY CONSIDER AUTOMATING THEM

#|                   Method | KNN (k = `r k_knn`) |    LDA    |    QDA    | Log. Regression | Random Forest (tuning param = ?) | SVM (tuning param = ?)|
#|-------------------------:|:--------------:|:---------:|:---------:|:---------------:|:--------------------------------:|:---------------------:|
#|                 Accuracy | `r acc_knn_FHO`%   | `r acc_lda_FHO`%   | `r acc_qda_FHO`%   | `r acc_LR_FHO`%   | `r acc_RF_FHO`    | `r acc_SVM_FHO`   |
#|                      AUC | `r auc_knn_FHO`%   | `r auc_lda_FHO`%   | `r auc_qda_FHO`%   | `r auc_LR_FHO`%   | `r auc_RF_FHO`    | `r auc_SVM_FHO`   |
#|                      ROC |                    |                    |                    |                   |                   |                   |
#|                Threshold | `r thresh_knn_FHO` | `r thresh_lda_FHO` | `r thresh_qda_FHO` | `r thresh_LR_FHO` | `r thresh_RF_FHO` |`r thresh_SVM_FHO` |
#| Sensitivity=Recall=Power | `r sens_knn_FHO`%  | `r sens_lda_FHO`%  | `r sens_qda_FHO`%  | `r sens_LR_FHO`%  |`r sens_RF_FHO`    | `r sens_SVM_FHO`  |
#|        Specificity=1-FPR | `r spec_knn_FHO`%  | `r spec_lda_FHO`%  | `r spec_qda_FHO`%  | `r spec_LR_FHO`%  |`r spec_RF_FHO`    |`r spec_SVM_FHO`   |
#|                      FDR | `r FDR_knn_FHO`%   | `r FDR_lda_FHO`%   | `r FDR_qda_FHO`%   | `r FDR_LR_FHO`%   | `r FDR_RF_FHO`    |`r FDR_SVM_FHO`    |
#|            Precision=PPV | `r prec_knn_FHO`%  | `r prec_lda_FHO`%  | `r prec_qda_FHO`%  | `r prec_LR_FHO`%  |`r prec_RF_FHO`    | `r prec_SVM_FHO`  |

3.12 Future Work

#consider if I was able to find an additional data source like lidar or infrared to pair with this dataset to improve model performance... ? 

4 References and Works Cited

5 Appendix A: Analysis Methods Reference Info

LDA QDA
Assumptions this is a lot of text what happens when you put this much text in this table
Tuning Parameters